Current Issue : October - December Volume : 2013 Issue Number : 4 Articles : 4 Articles
The work presented in this article studies how the context information can be used in the automatic sound event\r\ndetection process, and how the detection system can benefit from such information. Humans are using context\r\ninformation to make more accurate predictions about the sound events and ruling out unlikely events given the\r\ncontext. We propose a similar utilization of context information in the automatic sound event detection process. The\r\nproposed approach is composed of two stages: automatic context recognition stage and sound event detection\r\nstage. Contexts are modeled using Gaussian mixture models and sound events are modeled using three-state\r\nleft-to-right hidden Markov models. In the first stage, audio context of the tested signal is recognized. Based on the\r\nrecognized context, a context-specific set of sound event classes is selected for the sound event detection stage. The\r\nevent detection stage also uses context-dependent acoustic models and count-based event priors. Two alternative\r\nevent detection approaches are studied. In the first one, a monophonic event sequence is outputted by detecting the\r\nmost prominent sound event at each time instance using Viterbi decoding. The second approach introduces a new\r\nmethod for producing polyphonic event sequence by detecting multiple overlapping sound events using multiple\r\nrestricted Viterbi passes. A new metric is introduced to evaluate the sound event detection performance with various\r\nlevel of polyphony. This combines the detection accuracy and coarse time-resolution error into one metric, making\r\nthe comparison of the performance of detection algorithms simpler. The two-step approach was found to improve\r\nthe results substantially compared to the context-independent baseline system. In the block-level, the detection\r\naccuracy can be almost doubled by using the proposed context-dependent event detection....
This article analyzes and compares influence of different types of spectral and prosodic features for Czech and\r\nSlovak emotional speech classification based on Gaussian mixture models (GMM). Influence of initial setting of\r\nparameters (number of mixture components and used number of iterations) for GMM training process was\r\nanalyzed, too. Subsequently, analysis was performed to find how correctness of emotion classification depends on\r\nthe number and the order of the parameters in the input feature vector and on the computation complexity.\r\nAnother test was carried out to verify the functionality of the proposed two-level architecture comprising the\r\ngender recognizer and of the emotional speech classifier. Next tests were realized to find dependence of some\r\nnegative aspect (processing of the input speech signal with too short time duration, the gender of a speaker\r\nincorrectly determined, etc.) on the stability of the results generated during the GMM classification process.\r\nEvaluations and tests were realized with the speech material in the form of sentences of male and female speakers\r\nexpressing four emotional states (joy, sadness, anger, and a neutral state) in Czech and Slovak languages. In\r\naddition, a comparative experiment using the speech data corpus in other language (German) was performed. The\r\nmean classification error rate of the whole classifier structure achieves about 21% for all four emotions and both\r\ngenders, and the best obtained error rate was 3.5% for the sadness style of the female gender. These values are\r\nacceptable in this first stage of development of the GMM classifier. On the other hand, the test showed the\r\nprincipal importance of correct classification of the speaker gender in the first level, which has heavy influence on\r\nthe resulting recognition score of the emotion classification. This GMM classifier should be used for evaluation of\r\nthe synthetic speech quality after applied voice conversion and emotional speech style transformation....
Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervised learning research.\r\nIn most works, however, it is assumed that labeled and unlabeled data come from the same distribution. This\r\nrestriction is removed in the self-taught learning algorithm where unlabeled data can be different, but nevertheless\r\nhave similar structure. First, a representation is learned from the unlabeled samples by decomposing their data matrix\r\ninto two matrices called bases and activations matrix respectively. This procedure is justified by the assumption that\r\neach sample is a linear combination of the columns in the bases matrix which can be viewed as high level features\r\nrepresenting the knowledge learned from the unlabeled data in an unsupervised way. Next, activations of the labeled\r\ndata are obtained using the bases which are kept fixed. Finally, a classifier is built using these activations instead of the\r\noriginal labeled data. In this work, we investigated the performance of three popular methods for matrix\r\ndecomposition: Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF) and Sparse Coding (SC)\r\nas unsupervised high level feature extractors for the self-taught learning algorithm. We implemented this algorithm\r\nfor the music genre classification task using two different databases: one as unlabeled data pool and the other as data\r\nfor supervised classifier training. Music pieces come from 10 and 6 genres for each database respectively, while only\r\none genre is common for the both of them. Results from wide variety of experimental settings show that the\r\nself-taught learning method improves the classification rate when the amount of labeled data is small and, more\r\ninterestingly, that consistent improvement can be achieved for a wide range of unlabeled data sizes. The best\r\nperformance among the matrix decomposition approaches was shown by the Sparse Coding method....
Blind source separation (BSS) and sound activity detection (SAD) from a sound source mixture with minimum prior\r\ninformation are two major requirements for computational auditory scene analysis that recognizes auditory events in\r\nmany environments. In daily environments, BSS suffers from many problems such as reverberation, a permutation\r\nproblem in frequency-domain processing, and uncertainty about the number of sources in the observed mixture.\r\nWhile many conventional BSS methods resort to a cascaded combination of subprocesses, e.g., frequency-wise\r\nseparation and permutation resolution, to overcome these problems, their outcomes may be affected by the worst\r\nsubprocess. Our aim is to develop a unified framework to cope with these problems. Our method, called permutationfree\r\ninfinite sparse factor analysis (PF-ISFA), is based on a nonparametric Bayesian framework that enables inference\r\nwithout a pre-determined number of sources. It solves BSS, SAD and the permutation problem at the same time. Our\r\nmethod has two key ideas: unified source activities for all the frequency bins and the activation probabilities of all the\r\nfrequency bins of all the sources. Experiments were carried out to evaluate the separation performance and the SAD\r\nperformance under four reverberant conditions. For separation performance in the BSS EVAL criteria, our method\r\noutperformed conventional complex ISFA under all conditions. For SAD performance, our method outperformed the\r\nconventional method by 5.9ââ?¬â??0.5% in F-measure under the condition RT20 = 30ââ?¬â??600 [ms], respectively....
Loading....